Gapless genome assembly of Fusarium verticillioides, a filamentous fungus threatening plant and human health

Fusarium verticillioides is a filamentous fungus that causes plant diseases and harms human health through cancer-inducing mycotoxin and life-threatening Fusariosis. Given its threat to agriculture and public health, genome assembly of this fungus is critical to our understanding of its pathobiology and developing antifungal drugs. Here, we report a gap-free genome assembly of F. verticillioides using PacBio HiFi data and high-throughput chromosome capture (Hi-C) sequencing data. The assembled 42.0 Mb sequence contains eleven gapless chromosomes capturing all centromeres and 19 of all 22 telomeres. This assembly represents a significant improvement over previous version on contiguity (contig N50: 4.3 Mb), completeness (BUSCO score: 99.0%) and correctness (QV: 88.8). A total of 15,230 protein-coding genes were predicted, 6.2% of which are newly annotated genes. In addition, we identified three-dimension chromatin structures such as TADs-like structures and chromatin loops based on Hi-C data of ultra-high coverage. This gap-free genome of F. verticillioides is an excellent resource for further panoramic understanding mechanisms of fungal genome evolution, mycotoxin production and pathogenesis on plant and human host.


Background & Summary
Fusarium verticillioides, a filamentous fungus belonging to Fusarium fujikuroi species complex, causes Fusarium ear rot of maize, a major crop worldwide. Besides yield loss, various mycotoxins are produced during fungal infection of maize, reducing the quality of corn products. The best characterized F. verticillioides mycotoxins are fumonisins, a group of polyketide mycotoxins associated with esophageal cancer and neural tube birth defects in human populations consuming the contaminated maize products 1 . Although F. veticillioides is considered nonpathogenic to healthy human being, it can become a serious threat to individuals with compromised immune system such as those infected by undergoing organ transplants 2 . Human infection by F. verticillioides commonly known as Fusariosis has been a surging life threat to the immunocompromised patients due to limited options of antifungal drugs for treatment and emergence of multi-drug resistance 3 . Therefore, elucidation of molecular mechanisms underlying fungal pathogenesis and antifungal resistance in F. verticillioides is crucial to both agricultural safety and public health.
Despite the importance of this fungus, its complete genome sequence has not been assembled and thoroughly analyzed, impeding dissection of molecular and evolutionary mechanisms underlying its pathogenesis, secondary metabolism and drug resistance. The first genome assembly of F. verticillioides strain 7600 was released in 2010 4 with a contig N50 of 392.3 kb. Recently, several updated versions of F. verticillioides genome assemblies are available in NCBI (National Center for Biotechnology Information) genome database. Although these genome assemblies have since facilitated the genetic studies of fungal biological processes, they are highly fragmented with several hundreds to thousands of contigs. The fact that F. verticillioides has 11 chromosomes suggests the presence of gaps in these assembly versions. Furthermore, no telomere and centromere sequences (2023) 10:229 | https://doi.org/10.1038/s41597-023-02145-8 www.nature.com/scientificdata www.nature.com/scientificdata/ have been reported in any F. verticillioides genome assembly available, leaving these essential and complex genomic regions unexplored. A complete genome sequence for F. verticillioides would enable accurate characterization of the fungal genome function, regulation and evolution, shedding light on mechanisms of growth, development, pathogenicity and mycotoxin production.
For the gap-free assembly, we performed genome annotation to predict protein-coding genes and repeat elements. To see how much a nearly complete genome sequence improves genome annotations, the same annotation pipeline and RNA-seq data were applied to annotation of both our assembly and the previous version GCA_000149555.1. For protein-coding genes, the two genome assemblies were comparable where our assembly encodes 15,230 genes, a slight increase (+6.2%) compared to the previous assembly (Table 3; Fig. 1c). Comparing the two annotations revealed 15,056 genes present in both genome assemblies while 75 and 174 genes were uniquely annotated using previous and our genome assembly, respectively. The new genome assembly contains 2.8% (1,164,494 bp) repeat content, higher than the previous version (1.7%, 708,545 bp). Specifically, our assembly contains 120,266 bp LTR (long terminal repeat) element (+102.9%) and 102,640 bp DNA transposon (+2,608.2%) ( Table 3).
Compared to previous genome assemblies, this gap-free genome assembly of F. verticillioides contained all centromeres on 11 chromosomes (Fig. 2a), thanks to the highly accurate HiFi sequence data and improved assembly algorithms. To validate the centromere regions, we mapped the HiFi reads and RNA-seq reads to the gap-free assembly. We found a decent coverage of HiFi reads throughout the assembly including the centromeres (Figure 2b,c) and telomeres (Figure 2d,e) which contained no protein-coding genes and little RNA-seq alignment. By comparing this assembly with a previous assembly (GCA_000149555.1), we showed that numerous gaps were closed and three large inversions on the short arms of Chr3, Chr10 and Chr11 were corrected in this new assembly (Fig. 3a). Furthermore, unplaced scaffolds in GCA_000149555.1 are now anchored to correct chromosome positions (Fig. 3b). The gapless assembly contained a total of 890 kb new sequences including 25 kb to 231 kb per chromosomes which were absent in GCA_000149555.1 chromosomes (Fig. 3c).
Lastly, we analyzed the three-dimensional genome of F. verticillioides based on the Hi-C sequencing data, generated from fungal mycelia collected from culture. With a total of 53.8 Gb (1272.8X coverage) Hi-C data containing 95.8% valid interaction pairs after initial quality control (Fig. 4a), from which we identified 60 TADs (topological associated domains) -like structures and five chromosome loops under 10 kb resolution (  Fig. 4d). This gap-free genome assembly and updated annotation of F. verticillioides are excellent resources to study mechanisms of fungal genome evolution, mycotoxin production and pathogenesis on plant and human host.

Methods
Fungal culture, DNA preparation and PacBio HiFi sequencing. F. verticillioides strain 7600 was routinely maintained on PDA (potato dextrose agar) slant and stored in −80 °C freezer. F. verticillioides 7600 mycelia and spores harvested from two-day old PDB (potato dextrose broth) culture in 150 rpm shaker at 25 °C were used to isolate high molecular weight DNA using CTAB (cetyltrimethylammonium bromide) method 11 . A total of 15 µg purified genomic DNA were used to construct a standard PacBio SMRTbell library using PacBio SMRT Express Template Prep Kit 2.0 (Pacific Biosciences, CA). The sequencing was performed using a PacBio Sequel II instrument at Biomarker Technologies Corporation (QingDao, China). www.nature.com/scientificdata www.nature.com/scientificdata/ Hi-C sequencing and analysis. Hi-C library construction of F. verticillioides was prepared from cross-linked chromatins of fungal mycelia using a standard Hi-C protocol 12 . The constructed Hi-C sequencing library was sequenced by a test run and examined for valid interaction read pair ratios using HiCPro (v.3.1.0) 13 before going through high coverage sequencing. The library was sequenced by Illumina NovaSeq. 6000 to yield 10.5 Gb (249.7 coverage) paired-end reads. The valid interaction pairs of Hi-C sequencing reads were used to anchor all contigs using Juicer (v.1.5) 9 , followed by using a 3D-DNA correction pipeline 10 and manually correction with Juicebox (v.1.11.08) 14 . compartment A/B were analyzed using HiTC (v.1.40.1) 15 and Cworld-dekker (https://github.com/dekkerlab/cworld-dekker), TADs-like structures and chromosome loop were identified by Juicer (v.1.5) 9 . Three-dimensional structure visualization of the whole genome using pyGenomeTracks (v.3.7) 16 .

Data records
The raw PacBio HiFi sequencing data, Hi-C data and RNA-seq data have been deposited in the National Center for Biotechnology Information (NCBI) under the BioProject (PRJNA868307) 35 with accession number of SRR21003521 36 , SRR21003520 37 , SRR21003519 38 , respectively. The gap-free genome assembly is deposited under the same BioProject at NCBI (GCA_027571605.1) and also in Genome Warehouse of National Genomics Data Center (https://ngdc.cncb.ac.cn/) at China National Center for Bioinformation under the accession number of GWHBQEB00000000 39   www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
Manual adjustment of misjoin and detection of potentially contaminated sequences. To get a nearly complete and error-free nuclear genome, we first manually corrected the assembly using Hi-C read alignment within the Juicebox 14 . We then aligned the species' mitochondrial genome to our assembly by mega-BLAST 24 , which found no errors. Finally, we also used megaBLAST 24 to aligned our genome assembly against a common database (ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/contam_in_euks.fa.gz) to identify potentially contaminated sequences sequencing adaptor sequence (ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/adaptors_for_screening_ euks.fa) and nucleotide sequence database (remote mode), which again found no contamination. evaluation of the genome assembly. The genome assembly was validated by two independent methods.
Firstly, HiFi reads were mapped to the assembly using Winnowmap2 (v.2.03) 41 (parameters: -W repetitive_k15.txt -t 104 -ax map-pb) and the quality value (QV) was assessed using Merqury (v.1.3) 42 (parameters: k = 18 count). Second, the BUSCO (Benchmarking Universal Single-Copy Ortholog) 43 analysis was conducted to reflect the completeness of genome assembly. The final F. verticillioies gap-free genome assembly has a QV of 88.8, completeness of 99.7% and BUSCO score of 99%, suggesting the high accuracy and completeness of the assembly, respectively ( Table 2). www.nature.com/scientificdata www.nature.com/scientificdata/ www.nature.com/scientificdata www.nature.com/scientificdata/ Validation of the genome assembly. The resolved fungal telomere and centromere regions have been well covered by PacBio HiFi reads that span these complex regions (Fig. 2) by IGV (v.2.4.10) 44 . This assembly has reduced the length of gaps from 90,816 in previous version to 0, and captured eleven centromeres and nineteen telomeres (TTAGGG) except missing three telomeres via trf (v. 4.09.1) 45 (parameters: 2 7 7 80 10 90 2000 -d -m -l 2) from assemblies and raw sequences, one each at the end of Chr2 and Chr4 (Fig. 3). There is a one-to-one correspondence between the old and new versions of the genome with 14,260 coding region genes via liftoff (v.1.6.3) 46 and BEDtools (v.2.30.0) 47 (parameters: intersect -wa -wb -f 1.0), which account for 99.5% of the old version genome and 93.6% of this study genome. Compared to previous version (NCBI: GCA_000149555.1), our assembly has corrected three major inversions (Fig. 3) located at the short arm of Chr3, Chr10 and Chr11 visualized via GenomeSyn 48 plot.

Code availability
All software used in this study are in the public domain, with parameters being clearly described in Methods and this section. If no detail parameters were mentioned for the software, default parameters were used as suggested by developer.  Table 4. Genomic coordinates of chromatin loops in Fusarium verticillioides genome. X and Y represent the corresponding regions connected by a chromatin loop structure, where 1 and 2 mark the start and end of the region, respectively.